Skip to content

Conversation

@rnwang04
Copy link

@rnwang04 rnwang04 commented Sep 24, 2025

Tickets: CVS-173845

@github-actions github-actions bot added the category: continuous batching Continuous batching label Sep 24, 2025
@ceciliapeng2011 ceciliapeng2011 marked this pull request as draft September 24, 2025 05:32
@github-actions github-actions bot added the category: llm_bench Label for tool/llm_bench folder label Sep 26, 2025
Copy link
Collaborator

@Wovchena Wovchena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave C++ review to @vshampor

Copy link
Collaborator

@Wovchena Wovchena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for Vasily

@ceciliapeng2011 ceciliapeng2011 added this to the 2025.4 milestone Oct 14, 2025
Copy link
Contributor

@vshampor vshampor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comment

@github-actions github-actions bot added the category: WWB PR changes WWB label Oct 16, 2025
@Wovchena Wovchena requested a review from Copilot October 16, 2025 11:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the GPU block size configuration to support XAttention, which uses a larger block size (256) compared to the standard GPU block size (16). The changes enable detection of XAttention at runtime and configure the appropriate block size accordingly.

  • Adds XAttention detection logic based on cache dimensions
  • Introduces sparse attention configuration support in benchmarking tools
  • Refactors sparse attention setup into a reusable function

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/cpp/src/continuous_batching/cache_manager.hpp Adds XAttention detection and sets GPU block size to 256 when XAttention is enabled
tools/who_what_benchmark/whowhatbench/model_loaders.py Extracts sparse attention configuration into a separate function and adds validation logic
tools/llm_bench/llm_bench_utils/ov_utils.py Adds validation to prevent conflicting sparse attention configuration
tools/llm_bench/task/text_generation.py Moves GenerationConfig import outside conditional block for broader scope

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ceciliapeng2011
Copy link
Contributor

ceciliapeng2011 commented Oct 17, 2025

The test at

def test_cache_optimized_generation_is_similar_to_unoptimized(test_struct, apply_rotation, use_sparse_attention):

must be extended with the XAttention case,

Could do it. But how about a separate PR after CPU xattn is merged? Hope this won't a blocker to GPU xattn.

and/or additional tests must be added to demonstrate just what behaviour you had in mind for the case where the user just tries to switch from SparseAttentionMode::TRISHAPE to SparseAttentionMode::XATTENTION without changing his expectations about the block size of 16 that TRISHAPE used to work perfectly with.

If I understand correctly, "SparseAttentionMode" is set during compiling time when users create a LLMPipeline. The switch between TRISHAPE and XATTENTION will trigger plugin to re-compile the model. Any problem here?

Also, I guess I'll have to ask again - why is it that only enabling xattention changes the GPU block size implicitly to 128? If it's such a performant block size anyway, why do you not change the GPU block size to 128 everywhere and avoid implicit block size changes of which the user is unaware?

That's a good question. I think here you are mentioning GPU KV block size. It is 256. There are two sets of KV block size: 256 for xattention + xe2/xe3+ new GPU architectures, and 16 for legacy GPUs or without xattention. Sergey once suggested to change the block size to 256 everywhere. We have a phase-by-phase plan to do that. Considering execution priority and effort, this work will be happening after full functionality of xattention is productized. We've talked about this in the email thread.

@peterchen-intel
Copy link
Collaborator

peterchen-intel commented Oct 17, 2025

The test at

def test_cache_optimized_generation_is_similar_to_unoptimized(test_struct, apply_rotation, use_sparse_attention):

must be extended with the XAttention case,

Could do it. But how about a separate PR after CPU xattn is merged? Hope this won't a blocker to GPU xattn.

@vshampor I think we have to go in this way, or else the test case will always fail. Created a ticket to track CVS-175120

github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Oct 23, 2025
### Details:
 - *XAttention for FP16 KVCache as a preview feature*
 - [x] to add unit tests
- [x] to disable XAttention for legacy platforms (XAttention kernels are
implemented for Xe2/Xe3 with CM)
- [x] to streamline the process of xattention. Currently kvcache shape
is used to determine it. Maybe there is a better approach.
- [x] to add warning message for unsupported cases: multiple
subsequences, typo error of kvcache precision, etc.
- [ ] to remove the trivial converter nodes from xattention_threshold
Parameter to PageAttention input.
- [x] to refactor xattention kernel impls by reusing RT parameters,
instead of recomputing them.
 - [x] to enable path of U8 KVCache (stretch goal)
 - [x] WWB with long prompts

This PR should work along with
openvinotoolkit/openvino.genai#2764.

### Tickets:
 - *CVS-173857*

---------

Signed-off-by: Zhai, Xuejun <[email protected]>
Co-authored-by: river.li <[email protected]>
Co-authored-by: Luo Cheng <[email protected]>
Co-authored-by: Li, Tingqian <[email protected]>
Co-authored-by: rnwang04 <[email protected]>
Co-authored-by: Wang Wangwang <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
Co-authored-by: Zhai, Xuejun <[email protected]>
Co-authored-by: Luwei Zhou <[email protected]>
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Oct 26, 2025
### Details:
 - *XAttention for FP16 KVCache as a preview feature*
 - [x] to add unit tests
- [x] to disable XAttention for legacy platforms (XAttention kernels are
implemented for Xe2/Xe3 with CM)
- [x] to streamline the process of xattention. Currently kvcache shape
is used to determine it. Maybe there is a better approach.
- [x] to add warning message for unsupported cases: multiple
subsequences, typo error of kvcache precision, etc.
- [ ] to remove the trivial converter nodes from xattention_threshold
Parameter to PageAttention input.
- [x] to refactor xattention kernel impls by reusing RT parameters,
instead of recomputing them.
 - [x] to enable path of U8 KVCache (stretch goal)
 - [x] WWB with long prompts

This PR should work along with
openvinotoolkit/openvino.genai#2764.

### Tickets:
 - *CVS-173857*

Same as
[PR32064](#32064)
14e57f9
$ git fetch origin pull/32064/head:pr/32064
$ git fetch origin pull/32551/head:pr/32551
$ git diff pr/32551 pr/32064
(empty)

To resolve
[PR32064](#32064)
commits checks issue. one commit is signed by unknown author.
5201cdf

https://docs.github.com/en/github/authenticating-to-github/managing-commit-signature-verification/about-commit-signature-verification

---------

Signed-off-by: Zhai, Xuejun <[email protected]>
Signed-off-by: Chen, Peter <[email protected]>
Co-authored-by: river.li <[email protected]>
Co-authored-by: ceciliapeng2011 <[email protected]>
Co-authored-by: Luo Cheng <[email protected]>
Co-authored-by: Li, Tingqian <[email protected]>
Co-authored-by: rnwang04 <[email protected]>
Co-authored-by: Zhai, Xuejun <[email protected]>
Co-authored-by: Luwei Zhou <[email protected]>
Co-authored-by: Wang Wangwang <[email protected]>
@peterchen-intel
Copy link
Collaborator

The test at

def test_cache_optimized_generation_is_similar_to_unoptimized(test_struct, apply_rotation, use_sparse_attention):

must be extended with the XAttention case,

Could do it. But how about a separate PR after CPU xattn is merged? Hope this won't a blocker to GPU xattn.

and/or additional tests must be added to demonstrate just what behaviour you had in mind for the case where the user just tries to switch from SparseAttentionMode::TRISHAPE to SparseAttentionMode::XATTENTION without changing his expectations about the block size of 16 that TRISHAPE used to work perfectly with.

If I understand correctly, "SparseAttentionMode" is set during compiling time when users create a LLMPipeline. The switch between TRISHAPE and XATTENTION will trigger plugin to re-compile the model. Any problem here?

Also, I guess I'll have to ask again - why is it that only enabling xattention changes the GPU block size implicitly to 128? If it's such a performant block size anyway, why do you not change the GPU block size to 128 everywhere and avoid implicit block size changes of which the user is unaware?

That's a good question. I think here you are mentioning GPU KV block size. It is 256. There are two sets of KV block size: 256 for xattention + xe2/xe3+ new GPU architectures, and 16 for legacy GPUs or without xattention. Sergey once suggested to change the block size to 256 everywhere. We have a phase-by-phase plan to do that. Considering execution priority and effort, this work will be happening after full functionality of xattention is productized. We've talked about this in the email thread.

@vshampor @ceciliapeng2011 Let's continue the discussion in CVS-175590 (GPU KV block size) and CVS-175120 (test XATTN with CPU)

@peterchen-intel peterchen-intel dismissed vshampor’s stale review October 26, 2025 11:52

Agreed in chat that continue the merge of this PR and follow up discussion.



def get_scheduler_config_genai(cb_config):
def configure_sparse_attention(scheduler_params, scheduler_config):
Copy link
Collaborator

@peterchen-intel peterchen-intel Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2895 is working on the same purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching Continuous batching category: llm_bench Label for tool/llm_bench folder category: WWB PR changes WWB Code Freeze

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants